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Abstract 



OO ' Nonparametric density estimation is of great importance when econometricians want to 

(N ■ 

psj [ model the probabihstic or stochastic structure of a data set. This comprehensive review 
^ ! summarizes the most important theoretical aspects of kernel density estimation and provides 
an extensive description of classical and modern data analytic methods to compute the 

X ■ 

^ . smoothing parameter. Throughout the text, several references can be found to the most 
up-to-date and cut point research approaches in this area, while econometric data sets are 
analyzed as examples. Lastly, we present SIZer, a new approach introduced by Chaudhuri 
and Marron (2000), whose objective is to analyze the visible features representing important 
underlying structures for different bandwidths. 

Keywords: nonparametric density estimation; SiZer; plug-in bandwidth selectors; cross- 
validation; smoothing parameter. 
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1 Introduction 



The field of econometrics focuses on methods that address the probabilistic or stochastic 
phenomena involving economic data. Modeling the underlying probabilistic structure of the 
data, i.e., the uncertainty of the process, is a crucial task, for it can be used to describe 
the mechanism from which the data was generated. Thus, econometricians have widely 
explored density estimation, both the parametric and nonparametric approaches, to identify 
these structures and then make inferences about the unknown "true models". A parametric 
model assumes that the density is known up to a finite number of parameters, while a 
nonparametric model allows great flexibility in the possible form, usually assuming that it 
belongs to some infinite collection of curves (differentiable with square integrable second 
derivatives for example). The most used approach is kernel smoothing, which dates back to 
Rosenblatt (1956) and Parzen (1962). The aim of this paper is to review the most import 
aspects of kernel density estimation, both traditional approaches and modern ideas. 

A large extent of econometric research concerning estimation of densities has shown 
that a well estimated density can be extremely useful for applied purposes. An interesting 
comprehensive review of kernel smoothing and its applications can be found in Bierens 
(1987). Silverman (1986) and Scott (1992) discuss kernel density estimation thoroughly, 
giving details about assumptions on the kernel weight, properties of the estimator such as 
bias and variance, and discusses how to choose the smoothness of the estimate. The choice of 
the smoothing parameter is a crucial issue in nonparametric estimation, and will be discussed 
in detail in Section HI 

The remainder of this paper is as follows. In Section [2] we describe the most basic and 
intuitive method of density estimation: the histogram. Then, in Section[3]we introduce kernel 
density estimation and the properties of estimators of this type, followed by an overview of 
old and new bandwidth selection approaches in Section |H Finally, SiZer, a modern idea for 
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accessing features that represent important underlying structures through different levels of 
smoothing, is introduced in Section [51 



2 The Histogram 



The grouping of data in the form of a frequency histogram is a classical methodology that 
is intrinsic to the foundations of a variety of estimation procedures. Providing useful visual 
information, it has served as a data presentation device, however, as a density estimation 
method, it has played a fundamental role in nonparametric statistics. 

Basically, the histogram is a step function defined by bin heights, which equal the pro- 
portion of observations contained in each bin divided by the bin width. The construction 
of the histogram is very intuitive, and to formally describe this construction, we will now 
introduce some notation. Suppose we observe random variables Xi, . . . ,X„ i.i.d. from the 
distribution function Fx, and that Fx is absolutely continuous with respect to a Lesbegue 
measure on M. Assume that the data points observed from a realization of the 

random variables Xi, . . . , Define the bins as Ij = [xq + jh, xq + (j + l)h),j = 1, . . . ,k, 
for a starting point xq- Note that 



where ^ G Ij and the last equality follows from the mean value theorem for continuous 
bounded functions. Intuitively, we can approximate the probability of X falling into the 
interval Ij by the proportion of observations in Ij, i.e.. 



Using the approximation in ([2]) and the equation in ([T]), the density function f{x) can be 






(2) 



3 



estimated by 

A(x) = ii^i^ = J_ ^ 1(3;, e /,) for X G (3) 

nh nh ^-^ 

1=1 

where 

{1 if X G /,• 
otherwise. 

The smoothness of the histogram estimate is controlled by the smoothing parameter /i, 
a characteristic shared by all nonparametric curve estimators. Choosing a small bandwidth 
leads to a jagged estimate, while larger bandwidths tend to produce over smoothed histogram 
estimates (see Hardle (1991)). Figure [U shows an example of two histograms of the same 
randomly generated data: the histogram on the left hand side was estimated with a small 
bandwidth and consequently has many bins, while the histogram on the right hand side 
was computed with a large bandwidth, producing a smaller number of bins. The choice of 
the bandwidth is discussed in more detail in Section |H Note that in practice, the choice of 
k will determine h or vice versa (a rule of thumb for the choice of k is the Sturges' rule: 
k = 1 + log2 n). 

When building a histogram, not only the bandwidth needs to be chosen, but also the 
starting point of each bin edge. These choices can produce different impressions of the shape, 
and hence different estimates. The bin edge problem is a disadvantage of the histogram not 
shared by other estimators, such as the kernel density estimator. Another disadvantage is 
that the histogram estimators are usually not smooth, displaying bumps that may have been 
observed only due to noise. 

3 Kernel Density Estimation 

In econometrics, kernel density estimation is also known as the Parzen- Rosenblatt window 
method. It is an approach that is rooted in the histogram methodology. The basic idea is to 
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Figura 1: Histogram estimate with small bandwidth (left) and large bandwidth (right) 

estimate the density function at a point x using neighboring observations. However, instead 
of building up the estimate according to bin edges, the naive kernel method (adaptively) 
uses each point of estimation x as the center of the bin of width 2h. To express it more 
transparently, consider the weight function 

if Ixl < 1 

(4) 

otherwise, 

called the kernel weight. Then, the kernel estimate (Rosenblatt (1956)) of f{x) is defined as 

1=1 ^ ^ 

This kernel density estimator is specifically called naive because the kernel weight used is 
simply a bin of width 2/i centered at x. See Silverman (1986) for a deeper discussion about 
this kind of estimator. 

Note that the estimator in ^ is an additive function of the kernel weight, inheriting 
properties such as continuity and differentiability. Hence, it is not continuous and has zero 
derivatives everywhere except on the jump points Xi ± h. Moreover, even with a good 
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choice of h, estimators that use weights as in (jl]) most often do not produce reasonable 
estimates of smooth densities. This is because the discontinuity of the kernel weight gi- 
ves the estimate function a ragged form, creating sometimes misleading impressions due 
to several bumps and constant estimates where few data points are observed. As an il- 
lustration, we consider the CEO compensation data in 2012, containing the 200 highest 
paid chief executives in the U.S. This data set can be obtained from the Forbes website 



http:/ /www.forbes.com/lists/2012/12/ceo-compensation-12_rank.html For a better visuali- 
zation of the plot, we excluded the number 1 in the ranking, with an income of US$131.19 
mil, as it was an outlier. 



o 

<M -\ 



c/3 

c 

■o O 



o 

O —I 



CEO Compensation in 2012 




Naive Kernel 

Epanech Kernel 



30 40 
1 -Year pay ($mil) 



60 



Figura 2: Estimated density of CEO compensation using the naive (solid line) and the Epa- 
nechnikov(dashed line) kernels. 



Figure [2] shows two density estimators: the solid line represents the naive estimator, while 
the dashed line represents a more adequate kernel type, called Epanechnikov, which will be 
described later. The density estimated by the naive kernel appears to have several small 
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bumps, which are probably due to noise, not a characteristic of the true underlying density. 
On the other hand, the Epanechnikov kernel is smooth, avoiding this issue. 

A usual choice for the kernel weight is a function that satisfies K{x)dx = 1. If 

moreover, it is assumed that i^' is a unimodal probability density function that is symmetric 

about 0, then the estimated density f{x) is guaranteed to be a density. Note that the weight 

in (jl]) is an example of such choice. Suitable weight functions help overcome problems 

with bumps and discontinuity of the estimated density. For example, if is a gaussian 

distribution, the estimated density function / will be smooth and have derivatives of all 

orders. Table [1] presents some of the most used kernel functions and Figure |3] displays the 

format of the Epanechnikov, Uniform, Gaussian and Triweight kernels. 

Kenel weights 



Kernel weight 



Kfx) 



il(|a:|<l) 



Uniform 
Gaussian 

Epanechnikov ^ 
Biweight 

Triweight §(1 - x^)H{\x\ < 1) 

Tabela 1: Most common Kernel weights 
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Figura 3: Kernel weight functions 



One of the drawbacks of the kernel density estimation is that it is always biased, particu- 
larly near the boundaries (when the data is bounded). However, the main drawback of this 
approach happens when the underlying density has long tails. In this case, if the bandwidth 
is small, spurious noise appears in the tail of the estimates, or if the bandwidth is large 
enough to deal with the tails, important features of the main part in the distribution may be 
lost due to the over-smoothing. To avoid this problem, adaptive bandwidth methods have 
been proposed, where the size of the bandwidth depends on the location of the estimation. 
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See Section H] for more details on bandwidth selection. 



3.1 Properties of Kernel density estimators 

In this section, some of the theoretical properties of the kernel density estimator are 
derived, yielding reliable practical use. Assume we have Xi, . . . ,Xn i.i.d. random variables 
from a density / and let K{) be a Kernel weight function such that the following conditions 
hold 

J K{u)du = 1, J uK{u)du = 0, and J u^K{u)du = ^i2{K) > 0. 
Then, for a non-random h, the expected value of f{x) is 

It is easy to see that / is an asymptotic unbiased estimator of the density, since E{f{x)) — )■ 
f{x) J K{y)dy = f{x) when /i — 0. It is important to note that the bandwidth strongly 
depends on the sample size, so that when the sample size grows, the bandwidth tends to 
shrink. 

Now, assume also that the second derivative /" of the underlying density / is absolutely 
continuous and square integrable. Then, expanding f{x + yh) in a Taylor series about x we 
have 

fix - yh) = fix) - hyf'ix) + ^hYfix) + oih') 
Then, using the conditions imposed on the Kernel, the bias of the density estimator is 

Bzasifix)) = ^f"ix)f^,iK) + oih') (8) 
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The variance of the estimated function can be calculated using steps similar to those in 

©: 

Var{f{x)) = l^jK\y)f\x-hy)dy-^{EU\x)))' 

= \f K\y){f{x) + o{l)}dy = -{/(x) + o(l)}^ 
nn J n 

= \ l K\y)dyf{x) + o(^ 
nn J \nn J 

= i-ii(A-)/W+<,(± 

where R{g) = J g'^{y)dy for any square integrable function g. From the definition of Mean 
Square Error (MSE), we have 

MSE{f{x)) = jCf{x)-f{x)fdx = Var{f{x)) + Bias\f{x)) 

It is straightforward to see that, in order for the kernel density estimation to be consistent 
for the underlying density, two conditions on the bandwidth are needed as n — )■ oo: /i — )■ 
and nh — )■ oo. When these two conditions hold, MSE{f{x)) — 0, and we have consistency. 
Moreover, the trade-off between bias and variance is controlled by the MSE, where decreasing 
bias leads to a very noise (large variance) estimate and decreasing variance yields over- 
smoothed estimates (large bias). As has already been pointed out, the smoothness of the 
estimate depends on the smoothing parameter h, which is chosen as a function of n. For the 
optimal asymptotic choice of h, a closed form expression can be obtained from minimizing 
the Mean Integrated Square Error (MISE). Integrating the MSE over the entire line, we find 
(Parzen (1962)) 

MISEif) =E [ (/(.) - mfdx = ^'^''^(''^^(f'l (9) 

J nh 4 

and the bandwidth h that minimizes MISE is then 

'>MISE^(;|^)"«-. (10) 
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Using this optimal bandwidth, we have 



MMISEif) ^ I [f,l{K)R\K)Rin] 



1/5 



n 



4/5 



(11) 



A natural question is how to choose the kernel function K to minimize (fTT]) . Interestingly, 
if we restrict the choice to a proper density function, the minimizer is the Epanechnikov 
kernel, where fil{K)R\K) = 3^5^ 

The problem with using the optimal bandwidth is that it depends on the unknown 
quantity which measures the speed of fluctuations in the density /, i.e., the roughness of 
/. Many methods have been proposed to select a bandwidth that leads to good performance 
in the estimation, some of these are discussed in Section |H 

The asymptotic convergence of the kernel density estimator has been widely explored. 
Bickel and Rosenblatt (1973) showed that for sufficiently smooth / and K, sup^ |/(a;) — 
V /(•^)' when normalized properly, has an extreme value limit distribution. The strong 
uniform convergence of / 



has been studied extensively when the observations are independent or weakly dependent. 
Nadaraya (1965) showed that if K is of bounded variation and if / is uniformly continuous, 
then ( IT^ holds as long as J2m>i e'^™'^" < oo for each 7 > 0. Moreover, Stute (1982) derives 
a law of the logarithm for the maximal deviation between a kernel density estimator and the 
true underlying density function, Gine and Guillou (2002) find rates for the strong uniform 
consistency of kernel density estimators and Einmahl and Mason (2005) introduce a general 
method to prove uniform in bandwidth consistency of kernel-type function estimators. Other 
results on strong uniform convergence with different conditions can be found in several other 
papers, such as Parzen (1962), Bhattacharya (1967), Van Ryzin (1969), Moore and Yackel 
(1977), Silverman (1978) and Devroye and Wagner (1980) . 



lim sup \f{x) — f{x)\ = a.e. 



(12) 
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4 The choice of the smoothing parameter h 



Selecting an appropriate bandwidth for a kernel density estimator is of crucial importance, 
and the purpose of the estimation may be an influential factor in the selection method. In 
many situations, it is sufficient to subjectively choose the smoothing parameter by looking 
at the density estimates produced by a range of bandwidths. One can start with a large 
bandwidth, and decrease the amount of smoothing until reaching a "reasonable" density 
estimate. However, there are situations where several estimations are needed, and such 
an approach is impractical. An automatic procedure is essential when a large number of 
estimations are required as part of a more global analysis. 

The problem of selecting the smoothing parameter for kernel estimation has been explored 
by many authors, and no procedure has yet been considered the best in every situation. 
Automatic bandwidth selection methods can basically be divided in two categories: classical 
and plug-in. Plug-in methods refer to those that find a pilot estimate of /, sometimes 
using a pilot estimate of h, and "plug it in" the estimation of MISE, computing the optimal 
bandwidth as in ffTOj) . Classical methods, such as cross-validation. Mallow's Cp, AIC, etc, are 
basically extensions of methods used in parametric modeling. Loader (1999) discusses the 
advantages and disadvantages of the plug-in and classical methods in more detail. Besides 
these two approaches, it is possible to find an estimate of h based on a reference density. 
Next, we present in more detail the reference method and the most used automatic bandwidth 
selection procedures. 

4.1 Reference to a Distribution 

A natural way to overcome the problem of not knowing /" is to choose a reference density 
for /, compute /" and substitute it in ffTOj) . For example, assume that the reference density 
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is Gaussian, and a Gaussian kernel is used, then 



h 



MISE 



RiK) 



1/5 



n 



-1/5 



1/5 



n 



By using an estimate of a, one has a data-based estimate of the optimal bandwidth. In order 
to have an estimator that is more robust against outliers, the interquartile range R can be 
used as a measure of the spread. This modified version can be written as 

R 



^robust = l-06min (T,- 



34 



n 



-1/5 



Histogram of C02 




h_MISE 

— h robust 



10 20 30 40 50 

C02 in metric tons per capita in 2008 

Figura 4: Estimated density of C02 per capita in 2008 using the bandwidth that minimizes 
MISE(solid line) and the robust bandwidth(dashed line). 



Figure m shows the estimated density of C02 per capita in the year of 2008. The data set 
can be found at http://data.worldbank.org/indicator/EN.ATM.C02E.PC/countries Note 
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that the estimated density that was computed with the robust bandwidth captures the 
peak that characterizes the mode, while the estimated density with the bandwidth that 
minimizes MISE smoothes out this peak. This happens because the outhers at the tail of 
the distribution contribute to Hmise be larger than the robust bandwidth hrobust- For more 
details on this estimator, see Silverman (1986) or Hardle (1991). 

These methods are of limited practical use, since they are restricted to situations where a 
pre-specified family of densities is correctly selected. Plug-in and classical methods, described 
below, do not suffer from this limitation. 



4.2 Plug-in Methods 

There are several papers that address the plug-in approach for bandwidth selection. Some 
of them study different ways to estimate R{f"), others explore ideas on how to select a pilot 
bandwidth to better estimate R{f"). The idea is that the only unknown part of ( fTOl) needs 
to be estimated, and hence the bandwidth estimator /ijvjjgjr; can be obtained. 

Scott, Tapia and Thompson (1977) proposed a sequential process: calculate R{f") = 
R{fh^{x)), plug R{f") into ffTOj) to obtain h^, and iterate until convergence of the bandwidth. 
Hall and Marron (1987) proposed estimating ^(/(f^) by Rifjf^) = Rifh^) - SSrr- Parzen 
and Marron (1990) modified this idea, estimating R(f^P^) = R{fg^^^) — ^^p+i with g having 
the optimal rate given in Hall and Marron (1987). An improvement of Parzen and Marron 

(1990) method can be found in Sheather and Jones (1991). Hall, Sheather, Jones and Marron 

(1991) proposed to use a kernel of order 2 and to take one extra term in the Taylor expansion 
of the integrated square bias, leading to 

MISE2(/z) = ^ + ^f,l{K)R{f") - -f^2{K)^^,{K)R{n. (13) 

Since the minimizer of f|T3l) is not analytically feasible, they proposed to estimate the 
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bandwidth by 

Several other plug-in methods have been proposed, and a review of the first procedures 
that address this type of methodology can be found in Turlach (1993). Modern research 
on plug-in methods have actually become somewhat hybrid, combining ideas of plug-in and 
classical approaches such as cross validation, see Biased Cross- Validation described below for 
example. More recently, inspired by developments in threshold selection, Chan, Lee and Peng 
(2010) propose to choose h = 0{n^^^^) as large as possible, so that the density estimator has a 
larger bias, but smaller variance than fhAMssix)- The idea is to consider an alternative kernel 
density estimator / = ^,Y.UK [^) and define A„(x; /^) = f.^^H^IX^^,^.^. - 
Then, the choice for the smoothing parameter is 

hi = argmin{h : |A„(x;r)| > for all r > h,r G [cn~^^^'"' ']}, 

where Za denotes a critical point in A^(0, 1), c > and < e < 1/5. The intuition is that, 
when h is large A„(x; h) > Za, since A„(x; r) -4 A^(0, 1). 



4.3 Classical Methods 

4.3.1 Least Squares Cross- Validation 

Cross-validation is a popular and readily implemented heuristic for selecting the smo- 
othing parameter in kernel estimation. Introduced by Rudemo (1982) and Bowman (1984), 
least squares cross-validation is very intuitive and has been a fundamental device in recent 
research. The idea is to consider the expansion of the Integrated Square Error (ISE) in the 
following way 

ISE{h) = J fl{x)dx- j h{x)f{x)dx + j f{x)dx. 
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Note that the last term does not depend on fh, hence on h, so that we only need to consider 
the first two terms. The ideal choice of bandwidth is the one which minimizes 

L{h) = ISE{h) - J f\x)dx = j fl{x)dx - j f{x)f{x)dx. 

The principle of the least squares cross-validation method is to find an estimate of L{h) from 
the data and minimize it over h. Consider the estimator 



(14) 



where 



The summation in f |T^ has expectation 

^^5^ A,-4^*) = Efh,-n{Xn) =E j fH,-n{x)f{x)dx = E j fh{x)f{x)dx, 

because E{fh) depends only on the kernel and bandwidth, not on the sample size. It follows 
that E{CVis{h)) = E{L{h)), and hence CVisih) + j p{x)dx is an unbiased estimator of 
MISE (reason why this method is also called unbiased cross-validation) . Assuming that the 
minimizer of CVLs{h) is close to the minimizer of E[CVLs{h)), the bandwidth 

hiscv = arg mm CVLsih) 

h 

is the natural choice. This method suffers from sample variation, that is, using different 
samples from the same distribution, the estimated bandwidths may have large variance. 
Further discussion on this method can be found in Bowman, Hall and Titterington (1984), 
Hall (1983) and Stone (1984). 

4.3.2 Biased Cross- Validation 

Biased cross-validation considers the asymptotic MISE 



AMISEih} = ^ + ^iil{K)RU"). 
nn 4 
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This method was suggested by Scott and Terrell (1987), and its main idea is to replace the 
unknown quantity R{f") by the estimator 



R{f") = R{f;:)-{nh'r'R{K"' 



to give 



BCV{h) = {nh)-'R{K) + ^^il{K)R{f). 

Then, the bandwidth selected is hscv = arg minh BCV{h). This selector is considered a 
hybrid of cross-validation and plug-in, since it replaces an unknown value in AMISE by a 
cross- validatoin kernel estimate R{f"). 



4.3.3 Likelihood Cross- Validation 

Suppose that in addition to the original data set Xi, . . . , Xn, we have another independent 
observation X* from /. Thinking of fh as a parametric family depending on h, but with fixed 
data Xi, . . . ,Xn, we can view log/(X*) as the likelihood of the bandwidth h. Because in 
reality no additional observation is available, we can omit a randomly selected observation 
from the original data, say X^, and compute fh-ii^i), as in (ITSl) . Note that there is no 
pattern when choosing the observation to be omitted, so that the score function can be 
taken as the log likelihood average 

n 

CV{h) = n-'J2^ogU^.,{X,). 

i=l 

Naturally, we choose the bandwidth the minimizes CV{h), which is known to minimize the 
Kullback-Leibler distance between fh{x) and f{x). This method was proposed by Habbema, 
Hermans and van den Broek (1974) and Duin (1976), but other results can be found in 
Marron (1987), Marron (1989) and Cao, Cuevas and Gonzalez-Manteiga (1994). 
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In general, bandwidths chosen via cross validation methods in kernel density estima- 
tion are highly variable, and usually give undersmooth density estimates, causing undesired 
spurious bumpiness. 



4.3.4 Indirect Cross-validation 

The Indirect Cross-validation (ICV) method, proposed by Savchuk, Hart and Sheather 
(2010), slightly outperforms least squares cross-validation in terms of mean integrated squa- 
red error. The method can be described as follows. First define the family of kernels 
L = a,a) : a > 0,a > 0} where, for all u, L{u; a, a) = (1 -|- a)(j){u) — ^(p (^) . Note 
that this is a linear combination of two gaussian kernels. Then, select the bandwidth of 
an L-kernel estimator using least squares cross-validation, and call it bucv- Under some 
regularity conditions on the underlying density /, /i„ and 6„ that asymptotically minimize 
the MISE of and L-kernel estimators, have the following relation 



The indirect cross-validation bandwidth is chosen to be hjcv = Cbucv- Savchuk et al. 
(2010) show that the relative error of ICV bandwidths can converge to at a rate of n^/^, 
much better than the n^/^*^ rate of LSCV. 

4.4 Other Methods 

4.4.1 Variable Bandwidth 

Rather than using a single smoothing parameter h, some authors have considered the 
possibility of using a bandwidth h{x) that varies according to the point x at which / is 
estimated. This is often referred as the balloon estimator and has the form 





(16) 
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The balloon estimator was introduced by Loftsgaarden and Quesenberry (1965) in the form 
of the kth nearest neighbor estimator. In Loftsgaarden and Quesenberry (1965), h{x) was 
based on a suitable number k, so that it was a measure of the distance between x and the kth 
data point nearest to x. The optimal bandwidth for this case can be shown to be (analogue 
of (do]) for asymptotic MSE) 



Another variable bandwidth method is to have the bandwidth vary not with the point 
of estimation, but with each observed data point. This type of estimator, known as sample- 
point or variable kernel density estimator, was introduced by Breiman, Meisel and Purcell 
(1977) and has the form 



This type of estimator has one advantage over the balloon estimator: it will always integrate 
to 1, assuring that it is a density. Note that h{Xi) is a function of random variables, and 
thus it is also random. 

More results on the variable bandwidth approach can be found in Hall (1992), Taron, 
Paragios and Jolly (2005), Wu, Chen and Chen (2007) and Cine and Sang (2010). 

4.4.2 Binning 

An adaptive type of procedure is the binned kernel density estimation, studied by a few 
authors such as Scott (1981), Silverman (1982) and Jones (1989). The idea is to consider 
equally spaced bins Bi with centers at ti and bin counts Ui, and define the estimator as 




(17) 




(18) 



fbin{x) = - riiK 



n . 



^ ^ i=i 



( 



h 



) 



(19) 
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where the sum over m means summing over the finite non-empty bins that exist in practice. 
It is also possible to use a variable bandwidth in (fTOjl . yielding the estimator 

Examples of other approaches and discussion on this type of estimation can be found 
in Hall and Wand (1996), Cheng (1997), Minnotte (1999), Pawlak and StadtmuUer (1999), 
Holmstrom (2000). 



4.4.3 Bootstrap 

A methodology that has been recently explored is that of selecting the bandwidth using 
bootstrap. It focuses on replacing the MSE by MSE*, a bootstrapped version of MSE, which 
can be minimized directly. Some authors resample from a subsample of the data Xi, . . . , Xn 
(see Hall (1990)), others replace from a pilot density based on the data (see Faraway and Jhun 
(1990), Hazelton (1996), Hazelton (1999)), more precisely, from f^{x) = ELi ^ (^)' 
where L is another kernel and 6„ is a pilot bandwidth. Since the bandwidth choice reduces 
to estimating s in h = n'^^^s, Ziegler (2006) introduces fn,s{x) = ^^375^ J27=i ( n-vs's ) ' 
obtain M^E; ,(x) = E*((/* ,(x) - /^(x))^). The proposed bandwidth is 

s ' 

Applications of the bootstrap idea can be found in many different areas of estimation, 
see Delaigle and Gijbels (2004), Loh and Jang (2010) for example. 



4.4.4 Estimating Densities on R+ 

It is known that kernel density estimators have larger bias on the boundaries. Many 
methods have been proposed to alleviate such problem, such as the use of gamma kernels 
or inverse and reciprocal inverse gaussian kernels, also known as varying kernel approach. 
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Chen (2000) proposes to replace the symmetric kernel by a gamma kernel, which has flexible 
shapes and locations on ]R_|.. Their estimator can be described in the following way. Suppose 
the underlying density / has support [0, oo) and consider the gamma kernel 

K^/b+iM = 5x/6+ir(a;/6+l)' 

where 6 is a smoothing parameter such that — )■ and nh ^ oo. Then, the gamma kernel 
estimator is defined as 



1 

p{x) = -y^K,/b+i,b{Xi). 

The expected value of this estimator is 

j'OO 

Ep{x) = / K,/,+Mf{y)dy = Ef{^^ 
Jo 



where is a Gamma(x/b+l,b) random variable. Using Taylor Expansion and the fact that 
E{^x) = X + b and Var{^x) = xh + we have that 

Ef{i.) = f{x + b)+^-f"{x)Var{^,) + o{b) 



fix) + ^-xfix) 



o{h). 



It is clear then, that this estimator does not have bias problems on the boundaries, since the 
bias is 0{h) near the origin and in the interior. See Chen (2000) for further details. Other 
approaches on estimating the density on M^. can be found in Scaillet (2004), Mnatsakanov 
and Ruymgaart (2012), Mnatsakanov and Sarkisian (2012), Comte and V.Genon-Catalot 
(2012) and references therein. 

Some interest on density estimation research is on bias reduction techniques, which can 
be found in Jones, Linton and Nielsen (1995), Choi and Hall (1999), Cheng, Choi, Fan and 
Hall (2000), Choi, Hall and Roussan (2000) and Hall and Minnotte (2002). Other recent 
improvements and interesting applications of the kernel estimate can be found in Hirukawa 
(2010), Liao, Wu and Lin (2010), Matuszyk, Cardew-Hall and Rolfe (2010), Miao, Rahimi 
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and Rao (2012), Chu, Liau, Lin and Su (2012), Golyandina, Pepelyshev and Steland (2012) 
and Cai, Rushton and Bhaduri (2012) among many others. 

4.4.5 Estimating the distribution function F{x) 

It is not uncommon to find situations where it is desirable to estimate the distribution 
function F{x) instead of the density function f{x). A whole methodology known as kernel 
distribution function estimation (KDFE) has been explored since Nadaraya (1964) introdu- 
ced the estimator 

i=i ^ ' 

where K is the distribution function of a positive kernel /c, i.e K{x) = k(t)dt. Authors 
have considered many alternatives for this estimation, but the basic measures of quality or 
this type of estimator are 

ISE{h) = j [Fh{x) - F{x)fW{x)dF{x) and 
MISE{h) = E j[Fh{x)~F{x)fW{x)dF{x), 

where is a non-negative weight function. 

Sarda (1993) considered a discrete approximation to MISE, the average squared error 

1 " 

ASE{h) = - Y}Fh{Xi) - F{Xi)fW{Xi). 

i=l 

He suggests replacing the unknown F{Xi) by the empirical F„(Xj) and then selecting the 
bandwidth that minimizes the leave-one-out criterion 

1 " 

CV{h) = - V[F,,_,(X,) - F„(X,)]2l^(X,). 
n ^-^ 

i=l 

As an alternative to this cross-validation criterion, Altman and Leger (1995) introduce 
a plug-in estimator of the asymptotically optimal bandwidth. There is a vast literature 
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on estimating kernel distribution functions, for example Bowman, Hall and Prvan (1998), 
Tenreiro (2006), Ahmad and Amezziane (2007), Janssen, Swanepoel and Veraberbeke (2007), 
Berg and Politis (2009), just to cite a few. 

4.5 Example of Bandwidth Selection Methods 

It is well known that plug-in bandwidth estimators tend to select larger bandwidths when 
compared to the classical estimators. They are usually tuned by arbitrary specification of 
pilot estimates and most often produce over smoothed results when the smoothing problem 
is difficult. On the other hand, smaller bandwidths tend to be selected by classical methods, 
producing under smoothed results. The goal of a selector of the smoothing parameter is to 
make that decision purely from the data, finding automatically which features are important 
and which should be smoothed away. 

Figure [5] shows an example of classical and plug-in bandwidth selectors for a real data set. 
The data corresponds to the exports of goods and services of countries in 2011, representing 
the value of all goods and other market services provided to the rest of the world. The data 
set can be downloaded from the world bank website ( jhttp: / / data.worldbank.org ) . 



The plug-in estimators a) rule of thumb for Gaussian and b) Seather and Jones selector pro- 
duced a very smooth fit, while unbiased cross-validation selects a small bandwidth, yielding 
a highly variable density estimate. The hybrid method biased cross-validation, is the one 
that selects the largest bandwidth, hence its corresponding density estimate is very smooth, 
smoothing away information of the peak (mode). 
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Figura 5: Estimated densities for bandwidths chosen using different methods. 

5 SiZer 



In nonparametric estimation, the challenge of selecting the smoothing parameter that 
yields the best possible fit has been addressed through several methods, as described in 
previous sections. The challenge is to identify the features that are really there, but at the 
same time to avoid spurious noise. Marron and Chung (1997) and other authors noted that 
it may be worth to consider a family of smooths with a broad range of bandwidths, instead 
of a single estimated function. Figure [6] shows an example of a density generated from a 
mixture of a Gaussian variable with mean and variance 1 and another Gaussian variable, 
with mean 8 and variance 2. The density was estimated with a Epanechnikov kernel using 
bandwidths that vary from 0.4 to 10. The wide range of smoothing considered, from a small 
bandwidth producing a wiggly estimate to a very large bandwidth yielding nearly the simple 
least squares fit, allows a contrast of estimated features at each level of smoothing. The two 
highlighted bandwidths are equal to 0.6209704 and 1.493644, corresponding to the choice of 
biased cross-validation (blue) and to Silverman's rule of thumb (red) (see Silverman, 1986) 
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respectively 




The idea of considering a family of smooths has its origins in scale space theory in computer 
science. A fundamental concept in such analysis is that it does not aim at estimating 
one true curve, but at recovering the significant aspects of the underlying function, since 
different levels of smoothing may reveal different intrinsic features. Exploring this concept 
in a statistical point of view, Chaudhuri and Marron (2000) introduced a procedure called 
Slignificance ZERo crossings of smoothed estimates (SiZer), whose objective is to analyze 
the visible features representing important underlying structures for different bandwidths. 
Next, we briefly describe such method. 
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Suppose that h & H, where if is a subinterval of (0,oo), and x G /, where / is a 
subinterval of (—00,00). Then the family of smooth curves {fh{x)\h ^ H,x & 1} can be 
represented by a surface called scale space surface, which captures different structures of the 
curve under different levels of smoothing. Hence, the focus is really on E{fh{x)) as h varies 
in H and x in J, which is called in Chaudhuri and Marron (2000) as "true curves viewed at 
different scales of resolution" . 

A smooth curve fhix) has derivatives equal to at points of minimum (valleys), maximum 
(peaks) and points of inflection. Note that, before a peak (or valley), the sign of the derivative 
dfh{x)/dx is positive (or negative), and after it the derivative is negative (or positive). In 
other words, peaks and valleys are determined by zero crossings of the derivative. Actually, 
we can identify structures in a smooth curve by zero crossings of the mth order of the 
derivative. Using a Gaussian kernel K{x) = (l/-\/27r)ea;p(— a;^/2), Silverman (1981) showed 
that the number of peaks in a kernel density estimate decreases monotonically with the 
increase of the bandwidth, and Chaudhuri and Marron (2000) extended this idea for the 
number of zero crossings of the mth order derivative d"^ fh{x) / dx"^ in kernel regression. 

The asymptotic theory of the scale space surfaces and their derivatives studied by Chaudhuri 
and Marron (2000), which hold even under bootstrapped or resampled distributions, provides 
tools for building bootstrap confidence intervals and tests of significance for their features 
(see Chaudhuri and Marron (1999)). SiZer basically considers the null hypothesis 

i/o"'" : d'^E{Ux))/dx'^ = 0, 

for a fixed x G / and h E H. If H^'^ is rejected, there is evidence that d"^ E{fh{x)) / dx''^ is 
positive or negative, according to the sign of d"^ fh{x) / dx"^ . 

The information is displayed in a color map of scale space, where the pixels represent 
the location of x (horizontally) and h (vertically). The regions are shaded blue for sig- 
nificant increasing curve, red for significantly decreasing, purple for unable to distinguish 
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and gray for insufficient data. Note that purple is displayed when the confidence interval 
for the derivative contains 0. There are a few options of software available, including Java 



(http:/ /www. wagner.com/SiZer/SiZerDownload. html ), matlab (http:/ /vml.cas. unc.edu/stat- 



or/webspace/miscellaneous/marron/Matlab7Software/Smoothing/) and R (SiZer package). 
Figure [7] shows an example of a color map obtained with SiZer. The data is the GDP 



per person employed in 2010 (downloadable at http://data.worldbank.org). It is easy to 
see that for large bandwidths, the density function significantly increases until about 16000, 
then after a small area that SiZer is unable to distinguish, it has a significant decrease, hence 
estimating a density with one mode at around 16000. Small bandwidths produce a map that 
is mostly gray, meaning that the wiggles in the estimate at that level of resolution can not be 
separated from spurious sampling noise. An interesting blue area appears, with a mid-level 
resolution, near 43000, indicating a slightly significant increase. This comes after and before 
a purple area, which SiZer is unable to distinguish if it is increasing or decreasing. Thus, 
with a mid-level bandwidth, the estimated density would suggest 2 modes, one somewhere 
near 10000 and another near 43000. 



Family of Smooths 

DOOOOBt 




690 10299.63 19909.26 29516,89 39128,52 48738.15 53347,78 



Figura 7: SiZer example 
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